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Abstract 

A major cliallenge in microarray classification is that the number of features is typically orders of magnitude larger 
than the number of examples. In this paper, we propose a novel feature filter algorithm to select the feature 
subset with maximal discriminative power and minimal redundancy by solving a quadratic objective function with 
binary integer constraints. To improve the computational efficiency, the binary integer constraints are relaxed and a 
low-rank approximation to the quadratic term is applied. The proposed feature selection algorithm was extended 
to solve multi-task microarray classification problems. We compared the single-task version of the proposed feature 
selection algorithm with 9 existing feature selection methods on 4 benchmark microarray data sets. The empirical 
results show that the proposed method achieved the most accurate predictions overall. We also evaluated the 
multi-task version of the proposed algorithm on 8 multi-task microarray datasets. The multi-task feature selection 
algorithm resulted in significantly higher accuracy than when using the single-task feature selection methods. 



Background 

Microarray technology has the ability to simultaneously 
measure expression levels of thousands of genes for a 
given biological sample, which is classified into one of 
the several categories (e.g., cancer vs. control tissues). 
Each sample is represented by a feature vector of gene 
expressions obtained from a microarray. Using a set of 
microarray samples with known class labels, the goal is 
to learn a classifier able to classify a new tissue sample 
based on its microarray measurements. A typical micro- 
array classification data set contains a limited number of 
labeled examples, ranging from only a few to several 
hundred. Building a predictive model from such small- 
sample high-dimensional data is a challenging problem 
that has received a significant attention in machine 
learning and bioinformatics communities. To reduce the 
risk of over-fitting, a typical strategy is to select a small 
number of features (i.e., genes) before learning a classifi- 
cation model. As such, feature selection [1,2] becomes 
an essential technique in microarray classification. 
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There are several reasons for feature selection in 
microarray data, in addition to improving the classifier's 
generalization ability. First, the selected genes might be 
of interest to domain scientists interested in identifying 
disease biomarkers. Second, building a classifier from a 
small number of features could result in an easily inter- 
pretable model that could give important clues to biolo- 
gists. Depending on how the feature selection process is 
combined with model learning process, feature selection 
techniques can be organized into three categories. (1) 
Filter methods [3] are independent of the learning algo- 
rithm. (2) Wrapper methods [4] are coupled with the 
learning algorithm using heuristics such as forward 
selection and backward elimination. (3) Embedded 
methods [5,6] integrate feature selection as a part of the 
classifier training. Both the wrapper and the embedded 
methods effectively introduce hyper-parameters that 
require computationally costly nested cross-validation 
and increase likelihood of over-fitting. Feature filter 
methods are very popular because they are typically 
conceptually simple, computationally efficient, and 
robust to over-fitting. These properties also explain why 
the filter methods are more widely used than the other 
two approaches in microarray data classification. 
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Traditional filter methods rank the features based on 
their correlation with the class label and then select the 
top ranked features. The correlation can be measured by 
statistic tests (e.g., t-test) or by information-theoretic cri- 
teria such as mutual information. The filter methods 
easily scale up to high dimensional data and can be used 
in conjunction with any supervised learning algorithm. 
However, because the traditional filter methods access 
each feature independently, highly correlated features 
tend to have similar rankings and tend to be selected 
jointly. Using redundant features could result in low clas- 
sification accuracy. As a result, one common improve- 
ment for filter methods is to reduce redundancy between 
selected features. For example, minimal-redundancy- 
maximal-relevance (mRMR) proposed by [7] selects the 
feature set with both maximal relevance to the target 
class and minimal redundancy among the selected fea- 
ture set. Because of the high computational cost of con- 
sidering all possible feature sets, the mRMR algorithm 
selects features greedily, minimizing their redundancy 
with features chosen in previous steps and maximizing 
their relevance to the target class. 

A common critique of popular feature selection filters 
is that they are typically based on relatively simple heuris- 
tics. To address this concern, recent research resulted in 
more principled formulation of feature filters. For exam- 
ple, algorithms proposed in [8] and [9] attempt to select 
the feature subset with maximal relevance and minimal 
redundancy by solving a constrained quadratic optimiza- 
tion problem (QP). The objective used by [8] is a combi- 
nation of a quadratic term and a linear term. The 
redundancy between feature pairs is measured by the 
quadratic term and the relevance between features and 
class label is measured by the linear term. The features 
are ranked based on a weight vector obtained by solving 
a QP problem. The main limitation of this method is that 
the relevance between a feature and the class label is 
measured by either Pearson correlation or mutual infor- 
mation. However, Pearson correlation assumes normal 
distribution of the measurements, which might not be 
appropriate to measure correlation between numerical 
features and binary target. The mutual information 
requires using discrete variables and is sensitive to discre- 
tization. The objective used by [9] contains only one 
quadratic term. This quadratic term consists of two 
parts: one measures feature relevance using mutual infor- 
mation between features and the class label, and another 
measures feature redundancy using mutual information 
between each feature pair. However, the square matrix in 
the proposed quadratic term is not positive semi-definite. 
Thus, the resulting optimization problem is not convex 
and could result in poor local optima. 

In this paper, we propose a novel feature filter method 
to find the feature subset which maximizes the inter-class 



separability and intra-class tightness, and minimizes the 
pairwise correlations between selected features. We for- 
mulate the problem as a quadratic programming with bin- 
ary integer constraints. For high dimensional microarray 
data, to solve the proposed quadratic programming pro- 
blem with binary integer constraints requires high time 
and space cost. Therefore, we relax binary integer con- 
straints and apply the low rank approximation to the 
quadratic term in the objective function. The resulting 
objective function can be efficiently solved to obtain a 
small subset of features with maximal relevance and mini- 
mal redundancy. 

In many real-life microarray classification problems, the 
size of the given microarray dataset is particularly small 
(e.g., we might have less than 10 labeled high-dimensional 
examples). In this case, even the most carefully designed 
feature selection algorithms are bound to underperform. 
Probably the only remedy is to borrow strength from 
external microarray datasets. Recent research [10,11] illus- 
trates that multi-task feature selection algorithms can 
improve the classification accuracy. The multi-task feature 
selection algorithms select the informative features jointly 
across many different microarray classification data sets. 
Following this observation, we extend our feature selection 
algorithm to the multi-task microarray classification setup. 

The contributions of this paper can be summarized as 
follows. (1) We propose a novel gene filter method which 
can obtain a feature subset with maximal discriminative 
power and minimal redundancy; (2) The globally optimal 
solution can be found efficiently by relaxing the integer 
constraints and using a low-rank approximation techni- 
que; (3) We extend our feature selection method to 
multi-task classification setting; (4) The experimental 
results show our algorithms achieve higher accuracy than 
the existing filter feature selection methods, both in sin- 
gle-task learning and multi-task settings. 

Results and discussion 

We compared our proposed feature algorithm with 9 
representative feature selection filters. The first 6 are 
standard feature selection filters: Pearson Correlation 
(PC), ChiSquare [3], GINI, Infogain, Kruskal-Wallis test 
and Relief [12]. They rank the features based on different 
criteria that measure correlation between each feature 
and class label. The remaining 3 are the state-of-the-art 
feature selection methods which are able to remove 
redundant features: mRMR [7], QPFS [8] and SASMIF 
[9]. The feature similarity for both QPFS and our algo- 
rithm was measured by Pearson correlation. For fair 
comparison, for the SASMIF method we used top m 
ranked features. To balance the effect of feature relevance 
and feature redundancy, the parameter X in (9) was set to 

— — — — . The low- rank parameter k was set to 0.1 • M, 
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as suggested in [13]. Our algorithm is denoted as ST-BIP 
for single task version and MT-BIP for multi-task version. 

Given the selected features, we used LIBLINEAR [14] 
to train the linear SVM model. The linear SVM model 
was chosen because previous studies [5] showed SVM 
classifier could be very accurate on microarray data. The 
regularization parameter C of LIBLINEAR was chosen 
among {10'^, 10'*, 10^}. For the experiments in the 
single-task scenario, we used the nested 5 cross validation 
to select the optimal regularization parameter. For 
experiments in multi-task learning scenario, it was too 
time consuming to use the nested cross-validation to 
select the regularization parameter. Thus, we simply 
fixed the regularization parameter to 1 in the multi-task 
experiments. 

Single task feature selection 

In this section, we evaluate our proposed feature selection 
algorithm for single-task learning using four benchmark 
microarray gene expression cancer datasets: (1) Colon 
dataset [15] containing 62 samples, 40 tumor and 22 nor- 
mal samples; (2) Lung dataset [16] containing 86 samples 
coming from 24 patients that died and 62 that survived; 
(3) Diffuse B-cell Lymphoma (DLBCL) dataset [17] con- 
taining 77 samples, 58 coming from DLBCL patients and 

19 from Bcell lymphoma patients. (4) Myeloma dataset 
[18] containing 173 samples, 137 coming from patients 
with bone lytic lesions and 36 from control patients. We 
summarize the characteristics of these datasets in Table 1. 

For each microarray dataset, we randomly selected 

20 positive and 20 negative examples (except for choos- 
ing 15 positive and 15 negative in DLBCL dataset) as the 
training set and the rest as the test set. Due to the class 
imbalance in test sets, we used AUC, the area under 
the Receiver Operating Characteristic (ROC) curve, to 
evaluate the performance. The average AUC based on 
10 repetitions of experiments on different random splits 
to training and test set are reported in Table 2. We 
Compared the AUC accuracy of different feature selec- 
tion algorithms for m = 20, 50, 100, 200, 1000. For each 
dataset, the best AUC score among all methods was 
emphasized in bold. As shown in Table 2, our proposed 
method achieved the highest accuracy on Colon and 
DLBCL datasets. On the Myeloma dataset, it had the 
highest accuracy when m = 100 and 1000 and had the 
second highest accuracy when m = 20, 50 and 200. On 
the Lung dataset, our algorithm was ranked in the upper 
half of the competing algorithms. The last column in 



Table 1 Summary of the Microarray Datasets 





Colon 


Lung 


DLBCL 


Myeloma 


# Samples 


60(40/22) 


86(24/62) 


7/(58/19) 


173(137/36) 


# Genes 


2000 


5469 


5469 


12558 



Table 2 shows the average AUC score across four differ- 
ent datasets. Our method achieved the highest average 
AUC scores. The next two successful feature selection 
algorithms are Relief and QSFS. The mRMR had some- 
what lower accuracy, comparable to simple filters such as 
PC, ChiSquare, GINI and InfoGain. SASMIF was consid- 
erably less accurate, while KW was the least successful. 

Multi-task feature selection 

In this section, we evaluate our proposed feature selec- 
tion algorithm for multi-task learning. We used 8 cancer 
related binary microarray classification datasets published 
in [19]. The data are summarized in Table 3. As shown in 
Table 3, the size of the 8 microarray datasets was very 
small. The single-task feature selection algorithms are 
not expected to perform well because there might be 
insufficient information even when simple feature selec- 
tion filters are used. In contrast, our multi-task feature 
selection algorithm is expected to improve the accuracy 
by borrowing strength across multiple microarray 
datasets. 

For each microarray data set, we randomly selected AT = 
2, 3, 4, 5 positive and the same number of negative exam- 
ples as the training data and used the rest as the test data. 
We show the results for m = 100 in this section. The aver- 
age AUC across these 8 microarray datasets is shown in 
Figure 1. The results clearly show the multi-task version of 
our proposed algorithm was the most successful algorithm 
overall. 

To gain a deeper understanding about the reason why 
the multi-task feature selection algorithm obtained better 
overall accuracy than single-task feature selection algo- 
rithms, we show the AUC score of each individual microar- 
ray dataset based on = 3 in Table 4. We can see that 
the single task version of our feature selection algorithm 
had the highest overall accuracy among other single-task 
benchmarks, a result consistent with Table 2. The multi- 
task version of our algorithm has higher AUC than its sin- 
gle task version on 4 datasets and its average AUC is about 
1.5% higher. In 4 cases, (e.g. Colon, Lung, Pancreas, Renal 
datasets) we can also observe the negative transfer, where 
the accuracy drops. How to prevent negative transfer in 
multi-task feature selection would be another interest 
research topic for our future research. 

Gene-annotation enrichment analysis for multi-task 
microarray datasets 

The multi-task experimental results show that accuracies 
obtained by MT-BIP are better than other single task fea- 
ture filters overall. So we would like to perform function 
annotation of the MP selected genes. In MT-BIP filter, 
only one selected gene list is obtained for all 8 different 
types of cancers. Given this gene list, the top 10 enriched 
GO terms were obtained using DAVID Bioinformatics 
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Table 2 Average AUC of 10 different feature selection algorithms on 4 different microarray datasets 
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.149 


.685 ± .195 
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,770 
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.785 ± 
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InfoGain 


.743 ± 
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.505 ± .121 


.974 ± ,028 


.675 ± ,050 


.750 




KW 


.722 ± 


.198 


.558 ± .184 


,941 ± ,051 


.652 ± ,037 


.721 




Relief 


.728 ± 


.173 


.523 ± .150 


,980 ± ,019 


,698 ± ,051 


.757 




mRMR 


.743 ± 


.174 


.505 ± .121 


,975 ± ,025 


.677 ± ,050 


.751 




SASMIF 


.753 ± 


.149 


.587 ± .175 


,952 ± ,038 


,669 ± ,054 


.743 




QSFS 


.745 ± 


.175 


.524 ± .153 


,980 ± ,017 


.690 ± ,047 


.760 




ST-BIP 


.828 ± .063 


.153 ± .192 


.981 ± .020 


.722 ± .078 


,789 
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Table 3 Multi-Task MIcroarray Datasets(cancer:normal 
case) 



Bladder 


Lung 


Prostate 


Breast 


18(11/7) 


27(20/7) 


23(14/9) 


22(17/5) 


Renal 


Colon 


Pancreas 


Uterus 


24(11/13) 


26(15/11) 


21(11/10) 


17(11/6) 



Resources [20]. The top 10 enriched GO terms based on 
MT-BIP selected gene list is shown in Table 5. In this 
table, the hits means the number of genes that are found 
in the selected gene list associating with the specific GO 
term. The /7-value was obtained by Fisher Exact test 
which is used to measure the gene-enrichment in annota- 
tion terms. After we got the enriched GO terms, we used 
the Comparative Toxicogenomics Database (CTD) [21] 
to check whether there is an association between the GO 
term and the cancer type. The last column in Table 5 
shows the disease association for each GO term. The 
datasets are ordered as Bladder (B), Breast (B), Colon (C), 
Lung (L), Pancreas (P), Prostate (P), Renal (R) and Uterus 
(U). If a GO term is associated with the given type of 
cancer, we write down the cancer name. Otherwise, we 
put the symbol # in that position. We could see that the 
enriched GO terms based MT-BIP tends to associate 
many different types of cancer. As shown in Table 5, 
GO:0005856 (cytoskeleton) and GO:0005886 (plasma 
membrane) were associated with 7 different cancers. 
GO:003054 (cell junction), GO:0015629 (actin cytoskele- 
ton) and GO:0032403 (protein complex binding) are 
associated with 6 different cancers. 



Conclusion 

We proposed a novel feature filter method to select a fea- 
ture subset with discriminative power and minimal 
redundancy. The proposed feature selection method is 
based on quadratic optimization problem with binary 
integer constraints. It can be solved efficiently by relaxing 
the binary integer constrains and applying a low-rank 
approximation to the quadratic term in the objective. 
Furthermore, we extend our feature selection algorithm 
to multi-task classification problems. The empirical 
results on a number of microarray datasets show that in 
the single task scenario the proposed algorithm results in 
higher accuracy than the existing feature selection meth- 
ods. The results also suggest that our multitask feature 
selection algorithm can further improve the microarray 
classification performance. 

Methodology 

Feature selection by binary integer programming 

Let us denote the training dataset as Z) = (x„ j,), i = 1, 
N , where x, is an M dimensional feature vector for 
the i-th example and yi is its class label. N is the num- 
ber of training examples. Our objective is to select a fea- 
ture subset that is strongly predictive of class label and 
has low redundancy. We introduce a binary vector 
w= [wi,W2, ...,Wm]^ to indicate which features are 
selected: 

{1 if feature j is selected , 
0 if feature] is not selected 
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Table 4 Average AUC of 11 different feature selection algorithms on 8 different microarray datasets 





Blad 


Breast 


Colon 


Lung 


Pane 


Pros. 


Renal 


Uterus 


Ave 


PC 


.991 


.696 


.816 


.703 


.78 


.603 


.916 


.883 


.799 


ChiSc|uare 


.969 


.625 


.749 


.669 


.789 


.636 


.741 


.908 


.761 


GINI 


.969 


.625 


.749 


.669 


.789 


.636 


.743 


.917 


.762 


InfoGain 


.969 


.625 


.749 


.669 


.789 


.636 


.741 


.908 


.761 


KW 


.903 


.621 


.907 


.750 


.876 


.626 


.870 


.913 


.808 


Relief 


.991 


.729 


.795 


.721 


.796 


.594 


.929 


.888 


.805 


mRMR 


.969 


.650 


.765 


.682 


.830 


.682 


.786 


.875 


.780 


SASMIF 


.978 


.739 


.704 


.671 


.823 


.650 


.768 


.854 


.773 


QSFS 


.991 


.693 


.817 


.700 


.788 


.600 


.916 


.883 


.799 


ST-BIP 


.991 


.679 


.921 


.782 


.882 


.612 


.966 


.910 


.843 


MT-BIP 


.997 


.850 


.882 


.754 


.846 


.715 


.895 


.921 


.858 



So, the new feature vector for the i-th example after fea- 
ture selection can be represented as g, = x, 0 w, where the 
symbol O denotes the pairwise product. Therefore, gij = Xij, 
for Wj = 1 and gij = 0 for w,- = 0. Alternatively, g,- can be 
represented as g, = Wxi, where Wis a diagonal matrix and 
its diagonal is the vector w. 

Intuitively, we would like the examples with the same 
class to be close (intra-class tightness) and the examples 
from different classes to be far away (inter-class separ- 
ability) in the spaces defined by selected features. The 
Euclidean distance between two examples x, and x^ in 
the new feature space can be calculated as 



dij = 



ig,-g,ir 



i|x,0w->90w|l^= llWxi-W^II^ (2) 



The inter-class separability of the data can be mea- 
sured by a sum of the pairwise distances between exam- 
ples with different class labels 



Table 5 Top 10 enriched GO terms based on 100 MT-BIP 
selected genes 



Enriched GO Term 


Hits 


p-value 


Disease 
Association 


GO:0005856 cytoskeleton 


21 


749e-6 


#BCLPPRU 


GO:0043232 intracellular 


29 


1.79e-5 


######## 


non- 








GO:0043228 non- 
membrane- 


29 


1.79e-5 


######## 


GO:0003779 actin binding 


10 


5.35e-5 


#BCLPP## 


GO:0008092 cytoskeletal 


12 


641 e-5 


######## 


GO:0030054 cell junction 


11 


2.3 1e-^ 


BBCLPP## 


GO:0044459 plasma 
membrane part 


24 


2.53e-^ 


##C#P### 


GO:0005886 plasma 
membrane 


32 


1.09e-3 


#BCLPPRU 


GO:0015629 actin 

cytoskeleton 


7 


2.22e-3 


BBCLPP## 


GO:0032403 protein 
complex 


6 


3.83e-3 


#BCLPPRU 



||xi Ow-Xj O 



w 



(3) 



The intra-class tightness of the data can be measured 
by a sum of the pair-wise distances between examples 
with the same class label 



/ ^ ||xiOw-XjOw|| 

Yi=Yi 



(4) 



Therefore, the problem of selecting a feature subset to 
maximize the intra-class tightness and inter-class separ- 
ability can be formulated as 



min ^ ||xi O w — Xj O w|p — ^ ||xi © w — Xj © wp. 



Yi=Yi 



(5) 



Objective (5) can be rewritten as 
N N 

m ||xi O w - Xj O w|| Aij, 



mm 

w 



i=l i=l 

where matrix A is defined as: 



A,-, 



1 ifyi = Yj 
-1 ifVi T^Yj 



(6) 



(7) 



In addition to the objective (5) or (6), in order to 
improve the diversity of selected features, we would like 
to select a feature subset with minimal redundancy. A 
feature is defined to be redundant if there is another 
feature highly correlated with it. Let us denote Q as a 
symmetric positive semidefinite matrix with size M x 
M, whose element Qy represents the similarity between 
feature i and feature j. Since the measurements of each 
feature across different samples are normal distributed, 
it is reasonable to use Pearson Correlation to measure 
the similarity between two features here. Also, the simi- 
larity matrix Q is positive semi-definite when Pearson 
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Correlation is used. Then, we define a redundancy 
among the selected set of features represented by vector 
w as their average pair-wise similarity w^Qw/m^, where 
m is the number of selected features. Our objective is to 
minimize the redundancy defined in such way. 

The first contribution of this paper is to formulate the 
feature selection task as a new quadratic programming 
problem subject to binary integer and linear constraints as 
follows, 



1 



1 



minj]^ llxj O w — Xj O w ipAy • + A.— w'Qw 



i=l ;=1 

s.t. Wi e {0, IjVi 

M 

!=1 



(8) 



The first term in (8), which is a linear term as shown in 
the following Proposition 1, tries to maximize the inter- 
class separability and intra-class tightness of the data. It 
describes the discriminative power of the selected feature 
subset. The second quadratic term is the average pair- 
wise similarity score between the selected features, which 
results in reduction of feature redundancy. Parameter X 
is introduced to control the tradeoff between feature rele- 
vance and feature redundancy. Since Q is a positive semi- 
definite matrix, the proposed objective function is con- 
vex. The first constraint ensures that the resulting vector 
w is binary, while the second constraint ensures 
that exactly m features are selected. The following pro- 
position establishes that the first term in the objective 
(8) is linear. 

Proposition 1. The first term of the objective fixnction 
(8) can be written as a linear term w, where c is vector 
of size M with elements Cj = {X^LX)ii, L is the Laplacian 
matrix of A, defined as L - D - A. D is a diagonal 
degree matrix such that Da = X!j Ay. The X is the N x M 
feature matrix. Each row in X corresponds to one exam- 
ple. {X^LX)ii denotes the i-th element in the diagonal of 
the matrix X^LX. 

Proof. Let us denote Was a. diagonal matrix where W,, = 
Wj. Then, 

E E IN O w - Xj O w II^Ay = E E II - Wxj \fAy 



i=l;=l 



i=l ,=1 

trace[W'^X^LXW) 



because w, e {0, 1}, WW'^ = W. Therefore, 
trace{X^LXWW^) = J^ti (X^I^)iiW,< = c^w, where c; = 
{X^LX)u. □ 

Based on Proposition 1, objective (8) can be rewritten 
as the following constrained quadratic optimization pro- 
blem. 



min c^w + A — -vf^Qw 



s.t. Wi e {0, l}Vj 

M 

Wi = m. 



(9) 



1=1 



There are two practical obstacles in solving (9): (1) 
Binary constraint of variable w, and (2) feature similarity 
matrix Q is with size M x M, which implies high com- 
putational cost for high dimensional data. In the next 
two sections, we will first relax the binary constraint, 
and then we will apply a low-rank approximation to Q. 
The resulting constrained optimization problem can be 
solved very efficiently, with linear time with respect to 
the number of features M. 

Problem relaxation. Due to the binary constraint on 
the indictor vector w, it is difficult to solve (9) [9]. To 
resolve this, we first relax the binary constraint on w by 
allowing its elements Wj to be within the range [0, m]. 
Then, (9) could be approximated by 



min c^w + A, ^ w^Qw 
w 

s.t. Wi > 0 Vi 

M 

'wi = m. 

1=1 



(10) 



Now, (10) becomes a standard Quadratic Program- 
ming (QP) problem. The optimal solution can be 
obtained by a general QP solver (e.g., MOSEK [22]). 

Low-rank approximation. The matrix Q in (10) is of 
size M X M . So, it results in high time and space cost if 
we work with high dimensional microarray data. There- 
fore, we would like to avoid the computational bottle- 
neck by using low-rank approximation techniques. 

The matrix Q in (10) is symmetric positive semidefi- 
nite. So, it can be decomposed as Q = UAU^, where U 
is a matrix of eigenvectors and A is a diagonal matrix 
with corresponding eigenvalues of Q. By setting 



a = A2L/^w' that ^, 

problem (10) can be rewritten as 



minc^L/A 2a + A,— a^a 

a 

1 

s.t. UA 2a >0 
1 

IL/A 2Q! = m. 



UA-^a- Therefore, 



(11) 



Typically, the rank of Q (let us denote it as k) is much 
smaller than M, k <g; M. Therefore, we can replace the 
full eigenvector and eigenvalue matrices U and A by the 
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top k eigenvectors and eigenvalues, resulting in an M x k 
matrix Uf; and a. k x k diagonal matrix Af;, without losing 
much information. Therefore, (11) is reformulated as 

1 

min c^L/feAj, a + X—a^a 

(12) 

s.t. Uk\ ^ a > 0 
1 

IL/feA^ = m. 

Since a is a vector with length k, k <^ M. the QP (11) 
is reduced to a new QP in a /c-dimensional space with 
M + 1 constraints. Once the solution a of (12) is 

obtained, the variable w in original space can be 

_i 

approximated by ^ ^ jj^^^ 2 ^. 

Decomposing matrix Q requires 0(M^) time, which is 
expensive in microarray data where M is large. Next we 
will show how to efficiently compute the top k eigenvec- 
tors and eigenvalues using Nystrom approximation tech- 
nique [23]. Nystrom method approximates a M x M 
symmetric, positive semi-definite matrix Q by 

Q = EukW^^'El,^ (13) 

where Ej^k denotes the sub-matrix of Q created by 
selecting k of its columns, and is a sub-matrix that 
corresponds to the intersection of the selected columns 
and rows. Sampling schemes in Nystrom method 
include random sampling [23], probabilistic sampling 
[24], and /r-means based sampling [13]. We chose the 
A^-means sampling in our experiments because [13] 
showed that it produces very good low-rank approxima- 
tions at a relatively low cost. Given (13), we can easily 
obtain the low rank approximation of Q as 

1 

r " ? (14) 

Q = GG' where G = fi^feWj,^ ■ 

As shown in the following Proposition 2, the top k 
eigenvectors and eigenvalues can be computed in 0(MAr^) 
time using Nystrom method, which is much more efficient 
than doing eigen-decomposition of Q, which requires 
0(M^) time. 

Proposition 2. The top k eigenvectors and the cor- 
responding eigenvectors Ai^ofQ = GG^ can be approxi- 

_ I 

mated as = Ag and jj^ ^ GL/gA^ ^' ^^^'"^ and 
Aq are obtained by the eigen-decomposition of k x k 
matrix G^G = UgAcUI 

Proof. First, we observe that contains orthonormal 
columns. 



1 1 

UlUk = Ag2(jTGTGUGAg2 

1 1 

= Ac^UIUgAgUIUgAc^ =I. 

Next, we observe that 

1 1 

UkAkUl = GUgA^^AgA^^uIG^ = GG^ = Q 

□ 

Our proposed feature selection algorithm is summarized 
in Algorithm 1. In the Algorithm 1, steps 1 to 5 require 
0(Mk^ + !<?) time. QP in step 6 with k variables has a poly- 
nomial time complexity with respect to k. Step 7 requires 
0{Mk) time. Therefore, overall, the proposed feature selec- 
tion algorithm is very efficient and it has linear time com- 
plexity with the number of features M. 

Algorithm 1 Single-Task Binary Integer Program Fea- 
ture Selection 

Input: training data X, their labels y, regularized para- 
meter A, number of features m, low-rank parameter k. 
Output: m selected features 

1. Apply Proposition 1 to compute the vector c 

2. Use /:-means to select k landmark features for low- 
rank approximation of Q 

3. Compute EMk and W^k in (13) 

4. Obtain low-rank approximation of Q by (14) 

5. Apply Proposition 2 to compute the top k eigenva- 
lue Ai^ and eigenvector Uk of Q 

6. Obtain a by solving the lower dimensional QP pro- 
blem(12). 

7. Obtain w in original feature space as ^ ^ L/^A" 5 a 

8. Rank the features according to the weight vector w 

and select the top m features 

Multi-task feature selection by binary integer 
programming 

Multi-task learning algorithms have been shown to be 
able to achieve significantly higher accuracy than single 
task learning algorithms both empirically [11] and theo- 
retically [25]. Motivated by these promising results, in 
this section, we extend our feature selection algorithm 
to the multi-task setting. The objective is to select fea- 
tures which are discriminative and non-redundant over 
multiple microarray datasets. 

Let us suppose there are K different but similar classi- 
fication tasks, and denote the training data of the t-th. 
task as D' = {(xj, y\), i= 1, N'}, where is the 
number of training examples of the t-th task. [10,11] 
proposed multi-task feature selection algorithms that 
use £i_2 norm to regularize the linear model coefficients 
P across K different classification tasks. The ^1,2 norm 
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regularizer over all /?s across K classification tasks could 
be expressed as Yljli iYlf=i \ \Pt\\2)> where jg^ is the coef- 
ficient of the y-th feature in the t-th. task. Due to the €i 
norm on the £2 norm of group of coefficients of each 
feature across K tasks, the €1,2 norm regularizer selects 
the same feature subset across K tasks. However, the 
€1,2 norm regularized problem is challenging to solve 
because the non-smoothness of the £1^2 norm. In this 
section, we would like to show our proposed feature 
selection can be easily extended to multi-task learning 
version. The resulting objective optimization problem 
have the same form as objective (9), which can be solve 
efficiently as shown in previous section. 

Let us denote as the binary indicator defined in (1) 
to represent the selected feature subset of the t-th classi- 
fication task. If we do not consider the relatedness 
between these K classification task, individual could 
be obtained by applying Algorithm 1 to different classi- 
fication tasks. Based on the conclusion given by [10,11], 
it would be beneficial to select the same feature subset 
across K related classification task. In our case, this is 
can be achieved by setting w^ = w V Therefore, the 
same feature across K tasks, defined by vector w, can be 
obtained by solving the following optimization problem, 

K I 

min > cfw + X^rW^ } Q,w 

s.t. € {0, l}Vi (15) 
Wi = m. 

1=1 

where c, and Q, are the linear and quadratic terms of the 
QP corresponding to the /-th task. The details about how 
to compute the c,- and Q,- are explained in the previous sec- 
tion. The technique of relaxing binary integer constraints 
and applying low-rank approximation to Q introduced in 
the previous section can be used to solve (15). The 
extended multi-task feature selection algorithm is also a 
feature filter. It can be used in conjunction with any super- 
vised learning algorithm. 
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